-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect and log long running http(s) requests #17842
Detect and log long running http(s) requests #17842
Conversation
@Fryguy @carbonin @bdunne @NickLaMuro please review with brutal honesty... This middleware is run on every request so any and all calculations/slow operations need to be minimized. |
@NickLaMuro well, as coded (currently), this PR enables this by default since you might not be able to get to the UI to enable it if your puma threads are hung up. I'm willing to make this optional if the overhead is too much but ideally, we should just log this information whenever it happens. |
@jrafanie no enabling necessary, you can run it against an existing $ rbspy report -p 12345 |
ah, @kbrock @dmetzger57 probably have opinions on this too |
@NickLaMuro My bad, I was talking about this PR. It's on by default. |
Nope, totally get that, just trying suggesting alternatives that might avoid more maintenance for us. |
lib/request_started_on_middleware.rb
Outdated
# thread after we set one or more of the above local variables. The fallout | ||
# of this is we return a false positive for a request that finished very close | ||
# to the 2 minute timeout. | ||
if request.present? && started_on.kind_of?(Time) && REQUEST_TIMEOUT.ago > started_on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this supposed to use the class method rather than the constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀 YES! Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring for testing and forgetting to call the new method... It's a constant problem.
I updated the description with a logging example. |
8df16a6
to
3d9cf2e
Compare
log_long_running_requests | ||
end | ||
|
||
CHECK_LONG_RUNNING_REQUESTS_INTERVAL = 30.seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who knows if every 30 seconds is too frequent or not
return if @last_checked_hung_requests > CHECK_LONG_RUNNING_REQUESTS_INTERVAL.ago | ||
|
||
RequestStartedOnMiddleware.long_running_requests.each do |request, duration, thread| | ||
message = "Long running http(s) request: '#{request}' handled by ##{Process.pid}:#{thread.object_id.to_s(16)}, running for #{duration.round(2)} seconds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debated having the middleware method return the PID:TID format and rounded duration but chose to keep raw data there. I can be convinced to change it. 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does seem like this would be part of rails rails itself. Or part of new relic / scout / skylight
This will be nice to proactively track stuff down.
Is there a way to make this chartable rather than dumping into another log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can see making this more generic (it assumes a threaded web server like puma).
We had looked at an alternative, https://github.com/heroku/rack-timeout, which warns about the implications of raising a timeout like it does and how it's not for production... more for debugging... This PR was not meant to take action other than notify us of a long running request. We do not care if the threads are deadlocked or just doing too many things slowly, we just log the long running request. I honestly don't know that we ever want to take action other than log/notify people.
So, yes, if this works out for us, I can see making this a small standalone rack middleware with a generic interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this let us know about hung requests too?
This looks nice
|
||
# For testing: mocking Thread.list feels dangerous | ||
private_class_method def self.relevant_thread_list | ||
Thread.list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think this represents your race condition
return if @last_checked_hung_requests > CHECK_LONG_RUNNING_REQUESTS_INTERVAL.ago | ||
|
||
RequestStartedOnMiddleware.long_running_requests.each do |request, duration, thread| | ||
message = "Long running http(s) request: '#{request}' handled by ##{Process.pid}:#{thread.object_id.to_s(16)}, running for #{duration.round(2)} seconds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does seem like this would be part of rails rails itself. Or part of new relic / scout / skylight
This will be nice to proactively track stuff down.
Is there a way to make this chartable rather than dumping into another log?
Yes. I actually called this "hung requests" at first but since this detects all long running requests, including deadlocked ones and ones that just take really long to complete, I thought the wording was clearer as "long running requests."
Thanks, I'm hoping this will give us more clarity when people encounter 502 errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My two cents on the topic.
For what it is worth, I have done some of this in ManagePerformance
as well already.
lib/request_started_on_middleware.rb
Outdated
start_request(env['PATH_INFO'], Time.now.utc) | ||
response = @app.call(env) | ||
ensure | ||
complete_request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you could avoid a lot of complexity with this by just logging as part of the middleware.
def call(env)
start_time, end_time = [Time.now, nil]
begin
@app.call(env)
ensure
log_to_request_time_dot_log(start_time, end_time) if too_long?
end
end
The extra Thread
seems like a much larger over complication of what you are trying to do here, and I am unsure why it is needed in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, the thread part is the complicated part but is not without good reason. The problem is that the thread is hung and sometimes doesn't ever finish because it's deadlocked, killed before finishing, or finishes much later than the user gets the 502 error. In all three of these cases, we won't get to the log_to_request_time_dot_log
method until after (if ever) the user has captured or investigated the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't ever finish because it's deadlocked, killed before finishing, or finishes much later
Fair enough. I have some points below, but this is mostly just followup, and my real status on the review is here:
So for these three:
- "deadlocked": valid
- "killed before finishing": true
- "finishes much later than the user gets the 502": Well, this would still get logged, the data from the request would just end up being sent into the void.
That said, with the first two (which the first, the deadlock would probably lead to the second happening eventually), I want to say that I have seen when I have killed a process (at least with a Ctrl-C
), that the "Completed" message still ends up in the logs. I have to do some digging into the Rails source a bit more to see how ActiveSupport::LogSubscriber
handles control signals and allow this to happen, since that seems to be what is handling this here.
Curious if some form a rescue
/at_exit
would allow this to work, but since you are trying to be proactive long term, this might just be more of a curiosity/musing for myself than anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree, the "eventually" logs Completed
is something we can parse logs for after the fact... I've seen too many logs where they never Completed
due to deadlock, killed before finishing or the user captured the logs before it was logged.
The ONLY reason to do this "clever" in memory thread local variable thing is because we don't get the log message and therefore can't trust that we'll get to the rack request completion point in the thread and thus we need to use another thread to track long requests. If there is another way, I'm open to suggestions.
lib/request_started_on_middleware.rb
Outdated
def call(env) | ||
begin | ||
start_request(env['PATH_INFO'], Time.now.utc) | ||
response = @app.call(env) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can just be @app.call(env)
(without the response
var). ensure
doesn't mess with the the result of the block/method:
$ ruby -e "puts begin; 'foo'; ensure; puts 'hi'; end"
hi
foo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll fix this.
Also worth noting, I recently uploaded some code for parsing this information from the existing logs: (in gist form which is a single file and probably a bit easier to grok). Might not hurt to add something that simply prints out dangling requests that don't have a Specifically, I was using this when debugging the BZ for ManageIQ/manageiq-ui-classic#4447 since the app was also 502'ing and even getting reaped by the |
I think this is valuable to have regardless. I'm not sure where it should live but if it can parse PID:TID combinations spanning multiple compressed files (which it looks like it can do) to find each request that never completed or took too long, that would be extremely valuable. That said, I was hoping to be more proactive in detecting this in process unless we believe this is problematic or too much overhead. Perhaps not now, but it would be nice to be able to react to a hung request at some point in the future. |
lib/request_started_on_middleware.rb
Outdated
# thread after we set one or more of the above local variables. The fallout | ||
# of this is we return a false positive for a request that finished very close | ||
# to the 2 minute timeout. | ||
if request.present? && started_on.kind_of?(Time) && request_timeout.ago > started_on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this loop take a long time?
do we want to keep calculating the request_timeout.ago
or pre calculate it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, good idea... there's no reason to keep calculating this. Time calculation is slow, we should do it once at the top 👍
Yup, which is why I have it where it is at for now just so there is somewhere people can pull it from and make use of it. And if we find a place for it to live cool, happy to move it.
I see, didn't realize that was the case, so that changes my perspective a bit. I will need to do some thinking and further digging into what you have here to form a better opinion. I admittedly, have only glanced at this so far, so I probably will need some more time with this to give you a better answer. For what it is worth, my ❌ is not meant to be a blocker, so if I don't get back to my review of this, then don't let that hold you up. |
Yeah, this isn't something I want to force through.. I'd prefer we all have time to review it and are happy with it. Take your time. |
I'm fine with this conceptually, though I haven't fully reviewed the code. I am surprised there isn't a gem that just already does this. I understand that rack-timeout takes it a step further, but it still feels like a wheel that's already been invented. |
I've googled for rack long request, rack timeout request, rails long request...
Perhaps, a middleware as @NickLaMuro suggested with an ensure block would help us (without the thread local variables) in all cases except when the threads are deadlocked or the process gets SIGKILL'd. |
I'll make some of the suggested changes and we can decide what we want to do here. |
https://bugzilla.redhat.com/show_bug.cgi?id=1614918 This middleware tracks each request and when it started using thread local variables. long_running_requests traverses these thread local variables to find requests beyond the 2 minute timeout. This information can be retrieved by another thread, such as the heartbeating thread of a puma worker, which can then respond to a timed out request.
3d9cf2e
to
2f4bcb5
Compare
lib/request_started_on_middleware.rb
Outdated
|
||
def self.long_running_requests | ||
requests = [] | ||
timed_out_request_started_on = request_timeout.ago.utc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Keenan for this idea ^
lib/request_started_on_middleware.rb
Outdated
|
||
def call(env) | ||
start_request(env['PATH_INFO'], Time.now.utc) | ||
@app.call(env) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NickL, thanks for this simplification
Ok, I pushed a few changes suggested by @NickLaMuro and @kbrock... I didn't actually run this again to verify I didn't break something... at this point, I'd like to an idea if this is a direction we want to go. I'm concerned that if we don't track this in process and instead rely on the rack request to complete, we'll miss hung threads or threads brutally killed. Maybe we can improve both? Do we think this overhead is too great to run in production? |
https://bugzilla.redhat.com/show_bug.cgi?id=1614918 For puma based workers, we implement do_heartbeat_work to log_long_running_requests. This method is run very frequently per minute, so we ensure this check happens at most every 30 seconds. The log message includes the requested URI path, the thread (in the same PID:TID format that we log elsewhere) and the request duration so far. Using the PID:TID information, users can then find the request parameters by searching the logs for this PID:TID.
2f4bcb5
to
61680c1
Compare
I'm ok with this, the config.middlware stuff is a dsl and looks better without parens |
@NickLaMuro what do you think... I've summarized the two approaches: Rack middleware that does an
|
|
||
def start_request(path, started_on) | ||
Thread.current[:current_request] = path | ||
Thread.current[:current_request_started_on] = started_on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the thread local vars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I admit part of this review is just to remove the X
from my original review (EDIT: Bah... didn't work). Mostly minor suggestions and tweaks.
So, from what I can tell, if you switch to using env
for the :current_request_started_on
, you basically have no object allocations on a per request basis. And since you are hooking into do_heartbeat_work
, there really is no extra overhead that wasn't really already thread from how we run our workers (and the .log_long_running_requests
and RequestStartedOnMiddleware.long_running_requests
calls are as good as you can get, and would basically be what you would want to log anyway, regardless of the technique used (this versus a ensure
based approach).
Compared to what I suggested (ensure
), this has the advantage of logging multiple times for the same request, and being able to make use of the tread.backtrace
if we wanted (see specific comments below regarding that).
lib/request_started_on_middleware.rb
Outdated
|
||
def self.long_running_requests | ||
requests = [] | ||
timed_out_request_started_on = request_timeout.ago.utc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable name seems a bit wordy and confusing when read in the if
below.
Maybe something like timeout_threshold
or allowable_request_start_time
, and then the if reads like:
if ... && started_on < allowable_request_start_time
# ...
end
# There's a race condition where the complete_request method runs in another | ||
# thread after we set one or more of the above local variables. The fallout | ||
# of this is we return a false positive for a request that finished very close | ||
# to the 2 minute timeout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note for the future: This becomes a bigger issue if we were "react to a hung request" as you suggested here, but for now, this really isn't a big deal.
Logging something that was a "close to a 2 min request" is going to anger just about no one (except @kbrock, who probably thinks we log too much as it is... which is fair, but not relevant to my point).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NickLaMuro 😆 🤣 @kbrock
lib/request_started_on_middleware.rb
Outdated
end | ||
|
||
def call(env) | ||
start_request(env['PATH_INFO'], Time.now.utc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pedantic, but can we extract this Time.now
value from the env
instead of looking it up ourselves? Does it makes sense to make another Time.now
allocation, or should use what is there?
Honestly asking here though, because I am curious if you think there is an advantage or disadvantage to one or the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I was thinking you're right... maybe we should use the one in the headers but then I was looking at how newrelic pulls the times out of the header and it looks faster to do it the way I already have it. Check it out here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, yeah, I think what I have is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, I was under the impression that rack
/puma
would have added this info for us, but I guess not.
That said, Rails does do this, which lead me down a rabbit hole of investigation into "how", and going to spike on something real quick as a POC just for you to let me know what you think. This "idea" probably won't be superior in features, and is digging into the Rails private API a bit, but might be less code for us to maintain overall. We'll see...
lib/request_started_on_middleware.rb
Outdated
# of this is we return a false positive for a request that finished very close | ||
# to the 2 minute timeout. | ||
if request.present? && started_on.kind_of?(Time) && timed_out_request_started_on > started_on | ||
duration = (Time.now.utc - started_on).to_f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can "cache" Time.now
in a local variable above with request
and started_on
instead of looking it up every iteration of the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also something I noticed just now, is the .utc
calls done through out these changes even necessary? We never log the raw time, and only the difference between the start and current time when logging long requests. I only mention this because we are trying to keep it as minimal/lightweight as possible, and calls to .utc
seem pointless when dealing with time diffs.
Might be something that rubocop
complained about though, so just something I noticed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Yeah, since we're writing and reading the time, there's no need to store it as utc as long as it's always local
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, we're only running Time.now
once for each long request, which shouldn't be that frequent. The alternative is to do it once for EVERY request.
return if @last_checked_hung_requests > CHECK_LONG_RUNNING_REQUESTS_INTERVAL.ago | ||
|
||
RequestStartedOnMiddleware.long_running_requests.each do |request, duration, thread| | ||
message = "Long running http(s) request: '#{request}' handled by ##{Process.pid}:#{thread.object_id.to_s(16)}, running for #{duration.round(2)} seconds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to log the thread.backtrace
here as well? It would help debugging engineers find out where the bad parts of the codebase exist. If this is really a hung process, multiple checks on this on the 30 second intervals would let us know if we are looping on the same section of code on each check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? I don't know if it would be too noisy though. I hear @kbrock thinks we log too much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record: We can (and should only) do this in a followup, this was just something I thought was nice about this approach that we are able to do.
Maybe? I don't know if it would be too noisy though.
So my thought with this, and others, was:
- Ideally, no requests go over 10 seconds ever.
- When they do, and we have a reproducible environment, with zero changes (or maybe a settings tweak+restart) we can hit a slow request and git the UI workers on a regular interval with this signal (
SIGHUP
possibly or something unused by puma).
So going back to the "stacktrace" portion, this could maybe only be something that is only turned on when we are using a control signal, and the "light mode" is what is on all the time.
This could even be scripted so that once the "bad url" is determined, it can be hit, and another thread waits 1 minute, and then starts hitting the running UI workers with this signal. The whole script would run until some form of a result is returned from the server (502 or otherwise).
Anyway, just some ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, no requests go over 10 seconds ever.
🤣
yeah, we can do the backtrace with some signal but we have to be careful of puma's signal handling like you mentioned.
@jrafanie So I am taking some time to re-look over this and mull it over in my head, but to start with your original question:
I have just a few corrections/additions while I continue to do some thinking and looking over what you have here:
|
Removing this 'X' for now. Don't think it is sending the right message at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I forgot in my last review.
lib/request_started_on_middleware.rb
Outdated
requests | ||
end | ||
|
||
REQUEST_TIMEOUT = 2.minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two quick suggestions about this (which I didn't think to add into my last review):
- Maybe rename this to
LONG_REQUEST
, or something, since we really aren't "timing out" anything here. It is based on the apache timeout, but this simply logs a long request, and this is the value for what constitutes that. - Possibly make this only
1.minute
? On the performance team, we were told to try and make anything over 10 seconds in the UI faster, so waiting until 2.minutes to say something is "long" seems like it is a bit extreme. When paired with my suggestion to possibly include the.backtrace
with it, it seems like this would allow us to more quickly see when something is long and causing issues, and if the issue is moving from one code path to the next. And most requests (ideally) shouldn't get to this point where it is a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on both. I like the LONG_REQUEST
name... it makes more sense. I was thinking we could expose the 2.minutes
value because I also see value in changing it. Maybe 1.minute
across the board will be better as it will pick up long requests that don't currently timeout apache but also requests that do. I'll change it to 1.minute
and we'll go from there.
Sorry @jrafanie, you are getting 📖ed by me today... So I was thinking of a few things we could do with this as alternatives or "in addition to" what we have here. Most are later TODO's, and really shouldn't be considered for this PR, but I think are worth discussing:
I bring up the first point in particular because I don't know what you have been doing "to verify this" [ref] in the past (edit: looks like you did describe how in the BZ), but it would be nice if it wasn't just adding a Things like bloating memory, long unoptimized SQL queries, actual deadlocks, etc., that were reproducible and could help validate this code better. Here is a few examples of some I have done (mostly SQL related, so no deadlock examples):
|
I'll look. I'm not sure how to reliably do this without first rigging the system by either giving puma less time to do the request or make the request long via sleep.
Maybe? Log all threads or just the slow requests?
We can't catch SIGKILL via at_exit. We can do other signals though. Is this a problem at_exit? Normally, our problem is the silly ui worker stays up too long while all 5 of it's threads are hung/busy so premature killing of the worker hasn't been a problem in my experience. |
Checked commits jrafanie/manageiq@cf34c1f~...b53091e with ruby 2.3.3, rubocop 0.52.1, haml-lint 0.20.0, and yamllint 1.10.0 config/application.rb
|
Detect and log long running http(s) requests (cherry picked from commit cc6e263) https://bugzilla.redhat.com/show_bug.cgi?id=1614918
Hammer backport details:
|
When a request is made through apache, we have 2 minutes to complete this request before apache throws a timeout, resulting in a browser 502 error. This pull request provides a mechanism for people who encounter these 502 errors to figure out which request and parameters are causing this problem. Typically, the request is really slow and needs to be optimized, but we have had deadlock(s) occur so it's very helpful to minimize the work required to narrow the problem down.
https://bugzilla.redhat.com/show_bug.cgi?id=1614918
This middleware tracks each request and when it started using thread
local variables.
long_running_requests
traverses these thread localvariables to find requests beyond the 2 minute timeout. This
information can be retrieved by another thread, such as the heartbeating
thread of a puma worker, which can then respond to a timed out request.
For puma based workers, we implement
do_heartbeat_work
tolog_long_running_requests
. This method is run very frequently per minute,so we ensure this check happens at most every 30 seconds.
The log message includes the requested URI path, the thread (in the same
PID:TID format that we log elsewhere) and the request duration so far.
Using the PID:TID information, users can then find the request
parameters by searching the logs for this PID:TID.
Example logging:
Note, above, I have 2 requests that take 3 minutes to finish. Each one is logged twice.